{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "origin_pos": 0
   },
   "source": [
    "# Attention Scoring Functions\n",
    ":label:`sec_attention-scoring-functions`\n",
    "\n",
    "In :numref:`sec_nadaraya-waston`,\n",
    "we used a Gaussian kernel to model\n",
    "interactions between queries and keys.\n",
    "Treating the exponent of the Gaussian kernel\n",
    "in :eqref:`eq_nadaraya-waston-gaussian`\n",
    "as an *attention scoring function* (or *scoring function* for short),\n",
    "the results of this function were\n",
    "essentially fed into\n",
    "a softmax operation.\n",
    "As a result,\n",
    "we obtained\n",
    "a probability distribution (attention weights)\n",
    "over values that are paired with keys.\n",
    "In the end,\n",
    "the output of the attention pooling\n",
    "is simply a weighted sum of the values\n",
    "based on these attention weights.\n",
    "\n",
    "At a high level,\n",
    "we can use the above algorithm\n",
    "to instantiate the framework of attention mechanisms\n",
    "in :numref:`fig_qkv`.\n",
    "Denoting an attention scoring function by $a$,\n",
    ":numref:`fig_attention_output`\n",
    "illustrates how the output of attention pooling\n",
    "can be computed as a weighted sum of values.\n",
    "Since attention weights are\n",
    "a probability distribution,\n",
    "the weighted sum is essentially\n",
    "a weighted average.\n",
    "\n",
    "![Computing the output of attention pooling as a weighted average of values.](https://raw.githubusercontent.com/d2l-ai/d2l-en/master/img/attention-output.svg)\n",
    ":label:`fig_attention_output`\n",
    "\n",
    "\n",
    "\n",
    "Mathematically,\n",
    "suppose that we have\n",
    "a query $\\mathbf{q} \\in \\mathbb{R}^q$\n",
    "and $m$ key-value pairs $(\\mathbf{k}_1, \\mathbf{v}_1), \\ldots, (\\mathbf{k}_m, \\mathbf{v}_m)$, where any $\\mathbf{k}_i \\in \\mathbb{R}^k$ and any $\\mathbf{v}_i \\in \\mathbb{R}^v$.\n",
    "The attention pooling $f$\n",
    "is instantiated as a weighted sum of the values:\n",
    "\n",
    "$$f(\\mathbf{q}, (\\mathbf{k}_1, \\mathbf{v}_1), \\ldots, (\\mathbf{k}_m, \\mathbf{v}_m)) = \\sum_{i=1}^m \\alpha(\\mathbf{q}, \\mathbf{k}_i) \\mathbf{v}_i \\in \\mathbb{R}^v,$$\n",
    ":eqlabel:`eq_attn-pooling`\n",
    "\n",
    "where\n",
    "the attention weight (scalar) for the query $\\mathbf{q}$\n",
    "and key $\\mathbf{k}_i$\n",
    "is computed by\n",
    "the softmax operation of\n",
    "an attention scoring function $a$ that maps two vectors to a scalar:\n",
    "\n",
    "$$\\alpha(\\mathbf{q}, \\mathbf{k}_i) = \\mathrm{softmax}(a(\\mathbf{q}, \\mathbf{k}_i)) = \\frac{\\exp(a(\\mathbf{q}, \\mathbf{k}_i))}{\\sum_{j=1}^m \\exp(a(\\mathbf{q}, \\mathbf{k}_j))} \\in \\mathbb{R}.$$\n",
    ":eqlabel:`eq_attn-scoring-alpha`\n",
    "\n",
    "As we can see,\n",
    "different choices of the attention scoring function $a$\n",
    "lead to different behaviors of attention pooling.\n",
    "In this section,\n",
    "we introduce two popular scoring functions\n",
    "that we will use to develop more\n",
    "sophisticated attention mechanisms later.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%load ../utils/djl-imports\n",
    "%load ../utils/plot-utils\n",
    "%load ../utils/Functions.java\n",
    "%load ../utils/PlotUtils.java"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "NDManager manager = NDManager.newBaseManager();"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "origin_pos": 3
   },
   "source": [
    "## Masked Softmax Operation\n",
    "\n",
    "As we just mentioned,\n",
    "a softmax operation is used to\n",
    "output a probability distribution as attention weights.\n",
    "In some cases,\n",
    "not all the values should be fed into attention pooling.\n",
    "For instance,\n",
    "for efficient minibatch processing in :numref:`sec_machine_translation`,\n",
    "some text sequences are padded with\n",
    "special tokens that do not carry meaning.\n",
    "To get an attention pooling\n",
    "over\n",
    "only meaningful tokens as values,\n",
    "we can specify a valid sequence length (in number of tokens)\n",
    "to filter out those beyond this specified range\n",
    "when computing softmax.\n",
    "In this way,\n",
    "we can implement such a *masked softmax operation*\n",
    "in the following `masked_softmax` function,\n",
    "where any value beyond the valid length\n",
    "is masked as zero.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "public static NDArray maskedSoftmax(NDArray X, NDArray validLens) {\n",
    "    /* Perform softmax operation by masking elements on the last axis. */\n",
    "    // `X`: 3D NDArray, `validLens`: 1D or 2D NDArray\n",
    "    if (validLens == null) {\n",
    "        return X.softmax(-1);\n",
    "    }\n",
    "    \n",
    "    Shape shape = X.getShape();\n",
    "    if (validLens.getShape().dimension() == 1) {\n",
    "        validLens = validLens.repeat(shape.get(1));\n",
    "    } else {\n",
    "        validLens = validLens.reshape(-1);\n",
    "    }\n",
    "    // On the last axis, replace masked elements with a very large negative\n",
    "    // value, whose exponentiation outputs 0\n",
    "    X = X.reshape(new Shape(-1, shape.get(shape.dimension() - 1)))\n",
    "            .sequenceMask(validLens, (float) -1E6);\n",
    "    return X.softmax(-1).reshape(shape);\n",
    "}"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "origin_pos": 6
   },
   "source": [
    "To demonstrate how this function works,\n",
    "consider a minibatch of two $2 \\times 4$ matrix examples,\n",
    "where the valid lengths for these two examples\n",
    "are two and three, respectively.\n",
    "As a result of the masked softmax operation,\n",
    "values beyond the valid lengths\n",
    "are all masked as zero.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "maskedSoftmax(\n",
    "        manager.randomUniform(0, 1, new Shape(2, 2, 4)),\n",
    "        manager.create(new float[] {2, 3}));"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "origin_pos": 9
   },
   "source": [
    "Similarly, we can also\n",
    "use a two-dimensional NDArray\n",
    "to specify valid lengths\n",
    "for every row in each matrix example.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "maskedSoftmax(\n",
    "        manager.randomUniform(0, 1, new Shape(2, 2, 4)),\n",
    "        manager.create(new float[][] {{1, 3}, {2, 4}}));"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "origin_pos": 12
   },
   "source": [
    "## Additive Attention\n",
    ":label:`subsec_additive-attention`\n",
    "\n",
    "In general,\n",
    "when queries and keys are vectors of different lengths,\n",
    "we can use additive attention\n",
    "as the scoring function.\n",
    "Given a query $\\mathbf{q} \\in \\mathbb{R}^q$\n",
    "and a key $\\mathbf{k} \\in \\mathbb{R}^k$,\n",
    "the *additive attention* scoring function\n",
    "\n",
    "$$a(\\mathbf q, \\mathbf k) = \\mathbf w_v^\\top \\text{tanh}(\\mathbf W_q\\mathbf q + \\mathbf W_k \\mathbf k) \\in \\mathbb{R},$$\n",
    ":eqlabel:`eq_additive-attn`\n",
    "\n",
    "where\n",
    "learnable parameters\n",
    "$\\mathbf W_q\\in\\mathbb R^{h\\times q}$, $\\mathbf W_k\\in\\mathbb R^{h\\times k}$, and $\\mathbf w_v\\in\\mathbb R^{h}$.\n",
    "Equivalent to :eqref:`eq_additive-attn`,\n",
    "the query and the key are concatenated\n",
    "and fed into an MLP with a single hidden layer\n",
    "whose number of hidden units is $h$, a hyperparameter.\n",
    "By using $\\tanh$ as the activation function and disabling\n",
    "bias terms,\n",
    "we implement additive attention in the following.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "/* Additive attention. */\n",
    "public static class AdditiveAttention extends AbstractBlock {\n",
    "\n",
    "    private Linear W_k;\n",
    "    private Linear W_q;\n",
    "    private Linear W_v;\n",
    "    private Dropout dropout;\n",
    "    public NDArray attentionWeights;\n",
    "\n",
    "    public AdditiveAttention(int numHiddens, float dropout) {\n",
    "        W_k = Linear.builder().setUnits(numHiddens).optBias(false).build();\n",
    "        addChildBlock(\"W_k\", W_k);\n",
    "\n",
    "        W_q = Linear.builder().setUnits(numHiddens).optBias(false).build();\n",
    "        addChildBlock(\"W_q\", W_q);\n",
    "\n",
    "        W_v = Linear.builder().setUnits(1).optBias(false).build();\n",
    "        addChildBlock(\"W_v\", W_v);\n",
    "\n",
    "        this.dropout = Dropout.builder().optRate(dropout).build();\n",
    "        addChildBlock(\"dropout\", this.dropout);\n",
    "    }\n",
    "\n",
    "    @Override\n",
    "    protected NDList forwardInternal(\n",
    "            ParameterStore ps,\n",
    "            NDList inputs,\n",
    "            boolean training,\n",
    "            PairList<String, Object> params) {\n",
    "        // Shape of the output `queries` and `attentionWeights`:\n",
    "        // (no. of queries, no. of key-value pairs)\n",
    "        NDArray queries = inputs.get(0);\n",
    "        NDArray keys = inputs.get(1);\n",
    "        NDArray values = inputs.get(2);\n",
    "        NDArray validLens = inputs.get(3);\n",
    "\n",
    "        queries = W_q.forward(ps, new NDList(queries), training, params).head();\n",
    "        keys = W_k.forward(ps, new NDList(keys), training, params).head();\n",
    "        // After dimension expansion, shape of `queries`: (`batchSize`, no. of\n",
    "        // queries, 1, `numHiddens`) and shape of `keys`: (`batchSize`, 1,\n",
    "        // no. of key-value pairs, `numHiddens`). Sum them up with\n",
    "        // broadcasting\n",
    "        NDArray features = queries.expandDims(2).add(keys.expandDims(1));\n",
    "        features = features.tanh();\n",
    "        // There is only one output of `this.W_v`, so we remove the last\n",
    "        // one-dimensional entry from the shape. Shape of `scores`:\n",
    "        // (`batchSize`, no. of queries, no. of key-value pairs)\n",
    "        NDArray result = W_v.forward(ps, new NDList(features), training, params).head();\n",
    "        NDArray scores = result.squeeze(-1);\n",
    "        attentionWeights = maskedSoftmax(scores, validLens);\n",
    "        // Shape of `values`: (`batchSize`, no. of key-value pairs, value dimension)\n",
    "        NDList list = dropout.forward(ps, new NDList(attentionWeights), training, params);\n",
    "        return new NDList(list.head().batchDot(values));\n",
    "    }\n",
    "\n",
    "    @Override\n",
    "    public Shape[] getOutputShapes(Shape[] inputShapes) {\n",
    "        throw new UnsupportedOperationException(\"Not implemented\");\n",
    "    }\n",
    "\n",
    "    @Override\n",
    "    public void initializeChildBlocks(\n",
    "            NDManager manager, DataType dataType, Shape... inputShapes) {\n",
    "        W_q.initialize(manager, dataType, inputShapes[0]);\n",
    "        W_k.initialize(manager, dataType, inputShapes[1]);\n",
    "        long[] q = W_q.getOutputShapes(new Shape[] {inputShapes[0]})[0].getShape();\n",
    "        long[] k = W_k.getOutputShapes(new Shape[] {inputShapes[1]})[0].getShape();\n",
    "        long w = Math.max(q[q.length - 2], k[k.length - 2]);\n",
    "        long h = Math.max(q[q.length - 1], k[k.length - 1]);\n",
    "        long[] shape = new long[] {2, 1, w, h};\n",
    "        W_v.initialize(manager, dataType, new Shape(shape));\n",
    "        long[] dropoutShape = new long[shape.length - 1];\n",
    "        System.arraycopy(shape, 0, dropoutShape, 0, dropoutShape.length);\n",
    "        dropout.initialize(manager, dataType, new Shape(dropoutShape));\n",
    "    }\n",
    "}\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "origin_pos": 15
   },
   "source": [
    "Let us demonstrate the above `AdditiveAttention` class\n",
    "with a toy example,\n",
    "where shapes (batch size, number of steps or sequence length in tokens, feature size)\n",
    "of queries, keys, and values\n",
    "are ($2$, $1$, $20$), ($2$, $10$, $2$),\n",
    "and ($2$, $10$, $4$), respectively.\n",
    "The attention pooling output\n",
    "has a shape of (batch size, number of steps for queries, feature size for values).\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "NDArray queries = manager.randomNormal(0, 1, new Shape(2, 1, 20), DataType.FLOAT32);\n",
    "NDArray keys = manager.ones(new Shape(2, 10, 2));\n",
    "// The two value matrices in the `values` minibatch are identical\n",
    "NDArray values = manager.arange(40f).reshape(1, 10, 4).repeat(0, 2);\n",
    "NDArray validLens = manager.create(new float[] {2, 6});\n",
    "\n",
    "AdditiveAttention attention = new AdditiveAttention(8, 0.1f);\n",
    "NDList input = new NDList(queries, keys, values, validLens);\n",
    "ParameterStore ps = new ParameterStore(manager, false);\n",
    "attention.initialize(manager, DataType.FLOAT32, input.getShapes());\n",
    "attention.forward(ps, input, false).head();"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "origin_pos": 18
   },
   "source": [
    "Although additive attention contains learnable parameters,\n",
    "since every key is the same in this example,\n",
    "the attention weights are uniform,\n",
    "determined by the specified valid lengths.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "PlotUtils.showHeatmaps(\n",
    "            attention.attentionWeights.reshape(1, 1, 2, 10),\n",
    "            \"Keys\",\n",
    "            \"Queries\",\n",
    "            new String[] {\"\"},\n",
    "            500,\n",
    "            700);"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "origin_pos": 20
   },
   "source": [
    "## Scaled Dot-Product Attention\n",
    "\n",
    "A more computationally efficient\n",
    "design for the scoring function can be\n",
    "simply dot product.\n",
    "However,\n",
    "the dot product operation\n",
    "requires that both the query and the key\n",
    "have the same vector length, say $d$.\n",
    "Assume that\n",
    "all the elements of the query and the key\n",
    "are independent random variables\n",
    "with zero mean and unit variance.\n",
    "The dot product of\n",
    "both vectors has zero mean and a variance of $d$.\n",
    "To ensure that the variance of the dot product\n",
    "still remains one regardless of vector length,\n",
    "the *scaled dot-product attention* scoring function\n",
    "\n",
    "\n",
    "$$a(\\mathbf q, \\mathbf k) = \\mathbf{q}^\\top \\mathbf{k}  /\\sqrt{d}$$\n",
    "\n",
    "divides the dot product by $\\sqrt{d}$.\n",
    "In practice,\n",
    "we often think in minibatches\n",
    "for efficiency,\n",
    "such as computing attention\n",
    "for\n",
    "$n$ queries and $m$ key-value pairs,\n",
    "where queries and keys are of length $d$\n",
    "and values are of length $v$.\n",
    "The scaled dot-product attention\n",
    "of queries $\\mathbf Q\\in\\mathbb R^{n\\times d}$,\n",
    "keys $\\mathbf K\\in\\mathbb R^{m\\times d}$,\n",
    "and values $\\mathbf V\\in\\mathbb R^{m\\times v}$\n",
    "is\n",
    "\n",
    "\n",
    "$$ \\mathrm{softmax}\\left(\\frac{\\mathbf Q \\mathbf K^\\top }{\\sqrt{d}}\\right) \\mathbf V \\in \\mathbb{R}^{n\\times v}.$$\n",
    ":eqlabel:`eq_softmax_QK_V`\n",
    "\n",
    "In the following implementation of the scaled dot product attention, we use dropout for model regularization.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "/* Scaled dot product attention. */\n",
    "public static class DotProductAttention extends AbstractBlock {\n",
    "\n",
    "    private Dropout dropout;\n",
    "    public NDArray attentionWeights;\n",
    "\n",
    "    public DotProductAttention(float dropout) {\n",
    "        this.dropout = Dropout.builder().optRate(dropout).build();\n",
    "        this.addChildBlock(\"dropout\", this.dropout);\n",
    "        this.dropout.setInitializer(new UniformInitializer(0.07f), Parameter.Type.WEIGHT);\n",
    "    }\n",
    "\n",
    "    @Override\n",
    "    protected NDList forwardInternal(\n",
    "            ParameterStore ps,\n",
    "            NDList inputs,\n",
    "            boolean training,\n",
    "            PairList<String, Object> params) {\n",
    "        // Shape of `queries`: (`batchSize`, no. of queries, `d`)\n",
    "        // Shape of `keys`: (`batchSize`, no. of key-value pairs, `d`)\n",
    "        // Shape of `values`: (`batchSize`, no. of key-value pairs, value\n",
    "        // dimension)\n",
    "        // Shape of `valid_lens`: (`batchSize`,) or (`batchSize`, no. of queries)\n",
    "        NDArray queries = inputs.get(0);\n",
    "        NDArray keys = inputs.get(1);\n",
    "        NDArray values = inputs.get(2);\n",
    "        NDArray validLens = inputs.get(3);\n",
    "\n",
    "        Long d = queries.getShape().get(queries.getShape().dimension() - 1);\n",
    "        // Swap the last two dimensions of `keys` and perform batchDot\n",
    "        NDArray scores = queries.batchDot(keys.swapAxes(1, 2)).div(Math.sqrt(2));\n",
    "        attentionWeights = maskedSoftmax(scores, validLens);\n",
    "        NDList list = dropout.forward(ps, new NDList(attentionWeights), training, params);\n",
    "        return new NDList(list.head().batchDot(values));\n",
    "    }\n",
    "\n",
    "    @Override\n",
    "    public Shape[] getOutputShapes(Shape[] inputShapes) {\n",
    "        throw new UnsupportedOperationException(\"Not implemented\");\n",
    "    }\n",
    "\n",
    "    @Override\n",
    "    public void initializeChildBlocks(\n",
    "            NDManager manager, DataType dataType, Shape... inputShapes) {\n",
    "        try (NDManager sub = manager.newSubManager()) {\n",
    "            NDArray queries = sub.zeros(inputShapes[0], dataType);\n",
    "            NDArray keys = sub.zeros(inputShapes[1], dataType);\n",
    "            NDArray scores = queries.batchDot(keys.swapAxes(1, 2));\n",
    "            dropout.initialize(manager, dataType, scores.getShape());\n",
    "        }\n",
    "    }\n",
    "}"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "origin_pos": 23
   },
   "source": [
    "To demonstrate the above `DotProductAttention` class,\n",
    "we use the same keys, values, and valid lengths from the earlier toy example\n",
    "for additive attention.\n",
    "For the dot product operation,\n",
    "we make the feature size of queries\n",
    "the same as that of keys.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "queries = manager.randomNormal(0, 1, new Shape(2, 1, 2), DataType.FLOAT32);\n",
    "DotProductAttention productAttention = new DotProductAttention(0.5f);\n",
    "input = new NDList(queries, keys, values, validLens);\n",
    "productAttention.initialize(manager, DataType.FLOAT32, input.getShapes());\n",
    "productAttention.forward(ps, input, false).head();"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "origin_pos": 26
   },
   "source": [
    "Same as in the additive attention demonstration,\n",
    "since `keys` contains the same element\n",
    "that cannot be differentiated by any query,\n",
    "uniform attention weights are obtained.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "PlotUtils.showHeatmaps(\n",
    "        productAttention.attentionWeights.reshape(1, 1, 2, 10),\n",
    "        \"Keys\",\n",
    "        \"Queries\",\n",
    "        new String[] {\"\"},\n",
    "        500,\n",
    "        700);"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "origin_pos": 28
   },
   "source": [
    "## Summary\n",
    "\n",
    "* We can compute the output of attention pooling as a weighted average of values, where different choices of the attention scoring function lead to different behaviors of attention pooling.\n",
    "* When queries and keys are vectors of different lengths, we can use the additive attention scoring function. When they are the same, the scaled dot-product attention scoring function is more computationally efficient.\n",
    "\n",
    "\n",
    "\n",
    "## Exercises\n",
    "\n",
    "1. Modify keys in the toy example and visualize attention weights. Do additive attention and scaled dot-product attention still output the same attention weights? Why or why not?\n",
    "1. Using matrix multiplications only, can you design a new scoring function for queries and keys with different vector lengths?\n",
    "1. When queries and keys have the same vector length, is vector summation a better design than dot product for the scoring function? Why or why not?\n"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Java",
   "language": "java",
   "name": "java"
  },
  "language_info": {
   "codemirror_mode": "java",
   "file_extension": ".jshell",
   "mimetype": "text/x-java-source",
   "name": "Java",
   "pygments_lexer": "java",
   "version": "14.0.2+12"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
